使用 Pycurl 獲取 HTML (Getting HTML with Pycurl)


問題描述

使用 Pycurl 獲取 HTML (Getting HTML with Pycurl)

I've been trying to retrieve a page of HTML using pycurl, so I can then parse it for relevant information using str.split and some for loops. I know Pycurl retrieves the HTML, since it prints it to the terminal, however, if I try to do something like

html = str(c.perform())  

The variable will just hold a string which says "None". 

How can I use pycurl to get the html, or redirect whatever it sends to the console so it can be used as a string as described above?

Thanks a lot to anyone who has any suggestions!

‑‑‑‑‑

參考解法

方法 1:

this will send a request and store/print the response body:

from StringIO import StringIO    
import pycurl

url = 'http://www.google.com/'

storage = StringIO()
c = pycurl.Curl()
c.setopt(c.URL, url)
c.setopt(c.WRITEFUNCTION, storage.write)
c.perform()
c.close()
content = storage.getvalue()
print content

if you want to store the response headers, use:

c.setopt(c.HEADERFUNCTION, storage.write)

方法 2:

The perform() method executes the html fetch and writes the result to a function you specify.  You need to provide a buffer to put the html into and a write function.  Usually, this can be accomplished using a StringIO object as follows:

import pycurl
import StringIO

c = pycurl.Curl()
c.setopt(pycurl.URL, "http://www.google.com/")

b = StringIO.StringIO()
c.setopt(pycurl.WRITEFUNCTION, b.write)
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.perform()
html = b.getvalue()

You could also use a file or tempfile or anything else that can store data.

(by SinthetCorey GoldbergMakeSomething)

參考文件

  1. Getting HTML with Pycurl (CC BY‑SA 3.0/4.0)

#pycurl #Python






相關問題

python中的握手失敗(_ssl.c:590) (HandShake Failure in python(_ssl.c:590))

SmugMug 的變化似乎炸毀了 pysmug (changes at SmugMug appear to have blown up pysmug)

pycurl/curl 不遵循 CURLOPT_TIMEOUT 選項 (pycurl/curl not following the CURLOPT_TIMEOUT option)

需要幫助從 curl 遷移到 pycurl (need help with moving from curl to pycurl)

Tornado 的 AsyncHTTPClient 從 1.2 升級到 2.0 後不再工作 (Tornado's AsyncHTTPClient no longer works after upgrade to 2.0 from 1.2)

PyCurl 替代方案,libcurl 的 pythonic 包裝器? (PyCurl alternative, a pythonic wrapper for libcurl?)

使用 Pycurl 獲取 HTML (Getting HTML with Pycurl)

如果請求的數據有時被壓縮,有時不被壓縮,如何使用 pycurl? (how to use pycurl if requested data is sometimes gzipped, sometimes not?)

在 MacOS 上安裝 pycurl。(鏈接時 ssl 後端(無/其他)與編譯時 ssl 後端(openssl)不同) (Installing pycurl on MacOS. (link-time ssl backend (none/other) is different from compile-time ssl backend (openssl)))

Python 3.7:在 Windows 10 上安裝 pycurl (Python 3.7: pycurl installation on Windows 10)

Windows 機器在 Thonny 上安裝 pycurl 模塊 (Windows machine Installing pycurl module on Thonny)

當 python 線程在網絡調用(HTTPS)中並且發生上下文切換時會發生什麼? (What happens when the python thread is in network call(HTTPS) and the context switch happens?)







留言討論